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Abstract— THIS PAPER IS ELIGIBLE FOR THE STUDENT 
PAPER AWARD. In this work we present a flexible, proba- 
bilistic and reference-free method of error correction for high 
throughput DNA sequencing data. The key is to exploit the high 
coverage of sequencing data and model short sequence outputs 
as independent realizations of a Hidden Markov Model (HMM). 
We pose the problem of error correction of reads as one of 
maximum likelihood sequence detection over this HMM. While 
time and memory considerations rule out an implementation of 
the optimal Baum- Welch algorithm (for parameter estimation) 
and the optimal Viterbi algorithm (for error correction), we pro- 
pose low-complexity approximate versions of both. Specifically, 
we propose an approximate Viterbi and a sequential decoding 
based algorithm for the error correction. Our results show that 
when compared with Reptile, a state-of-the-art error correction 
method, our methods consistently achieve superior performances 
on both simulated and real data sets. 

1. Introduction 

DNA sequencing is the process of finding the identity and 
order of nucleotides or bases, adenine (A), guanine (G), cyto- 
sine (C) and thymine (T), in DNA molecules. It is used widely 
in biological and medical research to determine the genomes 
of diverse organisms ranging from microbes to humans. In 
recent years the advent of low cost, high throughput DNA 
sequencing IT] has made it feasible to sequence multiple 
individuals, even entire populations of organisms. This techno- 
logical advance may be the key to achieving truly personalized 
medicine, and several other grand goals in biology. 

The summed length of the DNA molecules that need to be 
sequenced can vary from a few thousand bases to hundreds of 
gigabases. Sequencing operates by breaking the DNA double 
helix molecules at random locations and generating incomplete 
"reads" that start at one end of the resulting fragments and read 
contiguous bases along one of the two strands. Read lengths 
are quickly increasing, but vary from thirty base pairs (bp) in 
the past to over a thousand bp on some platforms |2]. 

A critical issue in DNA sequencing is the elevated error rate 
in reads from the current technology [|3]. While the throughput 
rate is high, substitution errors, e.g., base A called as C, and 
insertion/deletion errors where spurious bases are included or 
valid bases are left out, are frequent. Errors in reads pose a 
serious problem for downstream uses of sequence data, includ- 
ing sequence assembly lIU, where the full-length sequence is 
inferred from the short reads, and variant identification El, for 
detecting genetic heterogeneity in a population. 

The primary approach for dealing with errors is to capitalize 
on the high throughput of the sequencing technology. Such 



an excess of fragments are sequenced that each base in 
the DNA molecules is covered by multiple reads. However, 
because the starting location of each read is random, there 
is no alignment information to indicate which reads cover a 
particular base. This lack of alignment information makes the 
problem different from classical error correction [0. Indeed 
if alignment information were available, the problem would 
roughly reduce to the decoding of a repetition code. 

Error correction of noisy reads has received significant 
attention in the bioinformatics community in recent years JT]- 
We briefly review the methods most closely related to our 
proposed method. Many methods begin by counting the oc- 
currence of all fcmers in the reads. A fcmer is a substring of 
length k. Euler JS] corrects a read via the smallest set of 
corrections that make all fcmers in the read common, i.e. have 
high occurrences. Hammer (Q identifies cliques by linking 
similar fcmers, then corrects all members to the clique's 
consensus fcmer FreClu iflO'i corrects full-length reads if it 
finds a significantly more frequent read differing at just one 
position. Quake flTl iteratively corrects bases by maximizing 
a posterior probability of the true sequence given the observed 
read until all fcmers are common. Two other methods use 
a probability model to correct a read ifTZl or fcmer lfT3l 
to the most likely true sequence. In all these methods the 
focus must turn to fcmers when the read length is long to 
guarantee sufficient repetition to distinguish error and true 
bases. So as read lengths increase, even read-based methods 
must become fcmer-based. All fcmer-based methods ignore the 
fact that fcmers are dependently read as contiguous substrings 
within reads. Moreover, while some allow arbitrarily complex 
error models, either all error parameters must be provided a 
priori or the parameter estimation procedure is ad hoc. 

The work of lfT4l . modeled a genome as the output of 
a discrete memoryless source and determined the coverage 
levels required to guarantee correct sequence assembly under 
a noiseless read process. Approaches based on statistical 
modeling of the sequencing process have been used ifTSl . lfT6l 
for basecalling, but not for error correction of reads. 
Main contributions. In this work we address the problem of 
correcting errors in noisy reads from a signal processing and 
error control coding viewpoint. We consider reads from the 
Illumina DNA sequencer that is known to exhibit substitution 
errors (but essentially no insertion/deletion errors) [il71 . We 
demonstrate that Illumina reads can be modeled as symbols 
emitted from a Hidden Markov Model (HMM). To overcome 



the unmanageably large state space, we use constraints and 
penalties to estimate the HMM parameters. Given the param- 
eters of this HMM, we pose the problem of error correc- 
tion of reads as a maximum likelihood sequence detection 
problem. While time and memory considerations rule out an 
implementation of the optimal Baum- Welch algorithm (for 
parameter estimation) and the optimal Viterbi algorithm (for 
error correction), we propose low-complexity approximate 
versions of both. This approach is successful in identifying 
many errors. In addition we propose a sequential decoding [flSl 
algorithm that achieves even better performance. Our results 
on real, publicly available sequencing data for the E. coli 
genome demonstrate a 9% improvement in error correction 
rates over a current state of the art technique. 

II. Problem Formulation 

Let Q denote the genome to sequence. It is a 4-ary sequence 
of length \Q\, where each letter is in = {A,C,G,T}; we 
call these letters bases or nucleotides. Sequencing operates by 
breaking the genome into fragments, from which length L 
reads are made. The sequencer has access to multiple copies 
of Q and produces up to billions of reads. The starting point of 
the fragment within Q is random and unknown. The sequencer 
processes the fragment and outputs a read {x,y), where x is 
the sequencer's best guess of L bases in the fragment and y 
are the corresponding quality scores; the quality scores are 
discrete measures of confidence in the base calls x. A given 
base location will typically be covered by multiple reads. Let 
N denote the total number of reads obtained in this process. 
The coverage level is defined to be NL/\Q\. 
A. HMM modeling 

We model the sequencer as a Hidden Markov Model (HMM); 
each read is an independent realization of the HMM. Given 
s, the unknown true read of length L, we define s[i] as the 
i-th character of s and s[i...j] as the substring from position 
i to j (both s[i] and s[j] are included). Let St = s[t-k+i...t] 
be the t-th state. In the discussion below we will also refer 
to the states as fcmers. Similarly, Xt ~ x[t-k+i...t] will be 
referred to as the tth observed fcmer and — y[t-k+i...t] 
will denote the corresponding quality scores. We model the 
sequencer as transitioning between the states St^i to St. On 
the fth transition it emits the output {x[t],y[t]). 

To specify the model completely, we need to define: 

> State space /C, where |/C| < 4'"'. 

• Transition distribution p{s[t+i]\st), where 



\a) = 1, Va G /C. 



Emission distribution ft{x[t],y[t] \ St) with 

ft{x[t],y[t] I St) = qt{y[t] I x[t],st)gt{x[t] \ st), (1) 
where we assume the following simple forms. 



qtiylt] I x[t],St) 



qmiyit]) x[t] = s[t], 

^qniylt]) x[t] ^ s[t], and 
gtixlt] I St) = l{st e M'^ixt)}gt{xlt] \ sit]), (2) 
with J2gt{P\l3')^l, V/3'ea 



Our modeling philosophy is guided by the following consid- 
erations. It is well recognized that nucleotides in genomes 
display strong local dependence, and Markov models, like the 
one we use for s, have long been used to model this depen- 
dence |fT9l . Like most error correction correction methods, we 
start with a simple error model. Both gt{f3 \ /3'), the probability 
of (mis)reading base as /3, and qtj{q),j = 0, 1, the quality 
score probability mass functions, depend on position t. We 
expect that qto{q) is shifted right of qti{q), for all t, because 
the sequencer should assign higher quality scores to error free 
bases. We defer discussion of the role of the indicator function 
!{•} in eq. (|2]l (see text near eq. (|3]l below). 

The choice of fcmer length fc and state space /C are guided 
by several considerations. In principle, given k, one could 
choose all possible states as /C, but even for moderate 
k (around 15), such fZ is too big. Thus, for a given set of 
reads, we restrict JC to contain only observed fcmers. Even 
though /C includes erroneous fcmers, we hope to identify them 
during estimation of the HMM. The choice of fc depends on 
two conflicting requirements. On the one hand, we want an 
accurate model. If fc is too small, say fc = 3, then each fcmer, 
say ATC, will exist in several locations in the original Q. Our 
model will tend to "correct" uncommon downstream bases to 
common downstream bases. For example, if ATCG occurs 
twice and ATCT occurs once, then true read ATCT may 
be erroneously corrected to ATCG. On the other hand, very 
large k (though it cannot exceed the read length L), may lead 
to decreased fcmer coverage and eventually an overparame- 
terized, inestimable model. Thus, there is an ideal value of 
fc that achieves uniqueness and an estimable model. In our 
experiments, we choose fc to optimize performance. 

To reduce computational complexity, we further constrain 
the emission distribution. As errors are relatively rare, we only 
allow fcmers within a small Hamming distance of an observed 
fcmer Xt to have non-zero emission distributions. If we define 
the d-yieighborhood of observed fcmer Xt as 



W^ixt) ^{w.-weJCa-nA D{xt, w) < d}, 



(3) 



where D{-,-) is the Hamming distance function, this as- 
sumption reduces the overall number of parameters since 
gt{x[t] I St) =0 if St iM'^ixt). 

The HMM is fit to the read data using the iterative 
Expectation-Maximization (EM) algorithm (Baum- Welch). 
Subsume all model parameters into vector 0. The EM locally 
maximizes the likelihood and produces parameter estimate 
b. The initial parameters for the EM are computed as 
follows. For transition probabilities, we count the occurrence, 
n{a.,f3), of a. followed by (3 in all reads and set. 



p(°)(/3|a) 



n{a.,l3) 



T,0'enri{oi,P')' 
As for the emission part, we initialize 



a e /c,/3 G rj. 



Q„ 
1 

4' 



q G {l,...,(3max} 



(4) 



(5) 



/3en 



where Qmax is the maximum quality score the sequencer can 
produce, j G {0, 1} indicates presence of an error, and t G 



{k, k + I, . . . , L} is read position. We tried multiple random 
initializations and found the EM insensitive to choice of 9^*^K 

B. Penalized Estimation 

While the HMM reasonably captures the local dependence 
present in genomic sequences and the error characteristics of 
modem sequencers, it fails to recapitulate the finite genome 
length. To impose our certainty that the vast majority of 
fcmers are unique, we use the approximate penalty proposed 



J{0) 
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log(l+p(/3|a)/7) 
log(l + 1/7) ' 



that penalizes small transition probabilities and drives them 
to zero. Given the set of observed reads TZ, the EM can be 
adapted to maximize the penalized log-likelihood 

i{e\n)-\j{e), 

where A and 7 are user specified tuning parameters. In general, 
increasing A or decreasing 7 strengthens the penalty. Desirable 
values for A and 7 could be determined by imposing a level 
of sparsity consistent with a prior estimate of genome length. 
In the experiments reported here, we seek a strong penalty 
to push small transition probabilities to zero and eventually 
eliminate fcmers with suspiciously low coverage. Therefore, 
we choose 7 = 10""' and vary A over the rough grid 
{100, 150, 200, 250, 300} to optimize performance. 

III. Error Correction Algorithm 

Given a fitted HMM for the data, we now discuss the actual 
error correction algorithm. One naturally turns to the Viterbi 
algorithm to estimate the maximum likelihood "true read" s, 
given the pair {x,y). However, the state space IC, even for 
modest k and relatively small genome size \Q\, is formidable 
and prevents exact Baum-Welch and Viterbi algorithms. Ac- 
cordingly, we use an approximate Viterbi-like decoding and 
sequential decoding as discussed below. 

A. An approximate Viterbi Algorithm 

For computational reasons, the Viterbi is limited, like the 
HMM, to only consider true sequences, s, constrained by our 
assumptions on the emission distribution. While it propagates 
likelihoods of survivor paths, if a survivor path contains a state 
St differing at more than d locations from Xt, then state St is 
deemed implausible and the survivor path is not extended. The 
same restriction is applied during the Baum-Welch parameter 
estimation described in Section III-AI We call this decoding 
method A- Viterbi. 

B. Sequential Decoding on HMM with Fano Algorithm 

Sequential decoding was proposed as a way to decode convo- 
lutional codes (prior to the optimal Viterbi algorithm) lITSl . 
Il2n . For codes with high constraint lengths, it serves as 
a good low-complexity alternative to the Viterbi algorithm. 
In our work we adapt the Fano algorithm for determining 
the maximum likelihood state sequence in the HMM. Our 
discussion here is based on the description in [22] (see Fig. 
[TJ. In the Fano algorithm at any given stage there is only one 
active path, where a path is defined to be a sequence of states 



Si, . . . , S( that have a non-zero probability of occurrence. Let 
the path labels of the predecessor path, the current path, and 
the successor path be Up, Uc, and Ug. The corresponding Fano 
metrics are denoted as Mp, Mc, and Ms- A given candidate 
successor path i/,, corresponds to appending a new state to 
Uc such that the new state has a positive probability of being 
reached from the last state of Uc. The probability of choosing 
Us as the successor path can be computed from the transition 
distribution of the HMM; we denote it as a{i'c, ^s) below. 
Likewise the emission distribution specifies the probability 
of the emitted base and quality score corresponding to this 
transition; this is denoted by below (to avoid complicated 
notation). The Fano metrics are updated as follows. 



Ms = Mc + log2 [aiuc.Us)] + log2(6) + B. 



(6) 



Here B represents the bias whose value is chosen with the 
purpose that the Fano metric will keep increasing as long as 
we are on the correct path ll22l . 

IV. Experimental Results 

We compared the performance of the A- Viterbi and Fano 
decoding algorithms to Reptile on one simulated and one real 
dataset (Table |l]). Reptile is a top-performing method in a 
recent survey of error correction methods [0. All the results 
we present here assume we know the first true fcmer. Since 
we use simulation or resequencing experiments of known 
genomes, we can reliably infer this information. In practice, 
a known primer sequence is often attached to both ends of 
the DNA fragment being sequenced, so the assumption is not 
restrictive. All methods include model complexity parameters, 
such as k and d, that can be difficult to choose when the 
true sequence is unknown. In our experiments, with the true 
sequence available, we can tune these parameters for optimal 
performance. We emphasize that we have tuned all methods 
to achieve their respective best performance. 

For each dataset, maximum likelihood estimates of the 
HMM parameters were estimated using the penalized like- 
lihood (7 = 0.0001, A = 250) for various fcmer lengths, 
k = 13, 14, 15, and maximum Hamming distance, d = A. 
Then, A- Viterbi and Fano were used to perform the decoding 
at each chosen fc. The A- Viterbi algorithm was run using the 
same d used to estimate the HMM parameters. For the Fano 
algorithm, parameter B in Eq. (|6]l was set to be 2 and 10 
for the simulated and real dataset, respectively; we tried both 
A = 0.5 or 1.0. Note, the Fano algorithm places no constraints 
on the fcmer Hamming distances. 

Reptile uses a "tile" formed by concatenating two fcmers 
of length fc with overlap of length fc — step. Corrections 
are made if the observed tile is uncommon and there is a 
substantially more common tile in the neighborhood of the 
observed tile. The tile neighborhood is formed by allowing 
up to d errors in each fcmer Tile counts are computed from 
high quality reads only. We chose the best parameters (fc, step) 
using a grid-search over 7 < step < fc < 12. Given {k,step), 
thresholds for error correction decisions were automatically 
selected, following the instructions in the software manual and 
MaxBadQPerKmer was left at the default 4. The maximum 



Fig. 1 Sequential Decoding with Fano Metric on HMM 
Input: fcmers obtained from the read sequence, step size A, 

bias B, parameters for HMM 
Output: corrected read (or path v*) and Fano metric M* 
Initialization: threshold T = 0, Vp = dummy, Mp = 
— oo, Vc= fcmer in the first stage, Mc = 0, stage < = 1. 
1: Choose the successor path which has the largest Fano 
metric based on the transition probability of Vc and 
emission probability of the t + fcth base. Denote this path 
label as Vs and corresponding Fano metric as Ms- 
1: if Ms > T then 

3: Move one base forward and update 

i^p ~ i^c, Mp ~ Mc, Vc = i^s, Mc ~ Ms', set t f+1 

4: itt = L-k + l tlien 

5: Stop algorithm and output v>* , M* = Mc 

6: else 

7: if Mp < r + A then 

8: Tighten threshold T such that T < Mc < T + A. 

Go to step [U 

9: end if 
10: Go to Step [1] 
11: end if 
12: else 

13: a Mp>T then 

14: Move one base back and update 

Vs = i^c, Ms = Mc, Uc = Vp, Mc = Mp-, i t - 1 

and re-compute Mp and i/p. 
15: Attempt to find the non-visited successor path of Vc 

which has the largest Fano metric. Denote this path 

as Vt and its Fano metric as Mt- 
16: if Vt is empty then 
17: Go to Step [B] 

18: else 

19: Update Us = v>t, Ms = Mt and go to Step|2] 

20: end if 
21: else 

22: Lower threshold as T T - A and go to Step [T] 
23: end if 
24: end if 



TABLE 1 

Benchmark sequencing datasets 





Genome 


Read 


Number 




Error 


Dataset 


length 


length (bp) 


of reads 


Coverage 


rate (%) 


Dl 


250000 


36 


1000000 


144.0X 


1.23 


D2 


500000 


36 


2132517 


153.5X 


0.51 



errors allowed per fcmer was set to d = 4. All remaining 
parameters were left at their defaults. 

Let e be the total number of ground truth errors in the 
sequencing reads excluding those in the first fcmer (or tile 
for Reptile). The probability of error correction is defined as 
( = ce/e and gain is defined as ?/ = (ce — fa)/e, measuring 
the effective number of errors removed from the dataset [l23l . 

A. Dl: Simulated Dataset 

To create the simulated dataset, one million reads were 
generated by randomly sampling 36bp sequences from a 



TABLE 11 
Error Correction Results for D 1 







fct 


ce 


c 


fa 


V 






13 


430391 


0.9981 


100 


0.9979 


Fano 


A = 0.5 


14 


428324 


0.9993 


21 


0.9993 






15 


425442 


0.9996 


6 


0.9996 






13 


430384 


0.9967 


227 


0.9962 


A-Viterbi 




14 


427839 


0.9978 


102 


0.9975 






IS 


424881 


0.998 


75 


0.9979 


Reptile 




(8, 8) 


355395 


0.8423 


3900 


0.8331 




(9, 9) 


405971 


0.9848 


678 


0.9832 






(10, 10) 


314306 


0.7867 


708 


0.7849 



t : For Reptile, this column is reported as (fc, step). 



250Kbp region (lOOOKbp — 1250Kbp) of the E. coli genome 
(Accession NC.000913). From the real dataset (see lIV-Bb , 
we estimated empirical distributions of quality scores given 
read positions, and used these to generate quality scores 
for every position of each simulated read. We assumed the 
simulated quality scores indicated the true error probabilities 
and replaced the true base f3t at position t with error base 
/?( 7^ Pt with probability ^ — where qt is the quality score. 

Table shows the error correction results for various 
choices of fcmer length, fc, or (fc, step) for Reptile. In bold, 
we show the best performance for each method, as measured 
by the gain metric rj along with the results for a few additional, 
nearby settings. Both Fano and A-Viterbi outperform Reptile 
by achieving higher error correction probabilities while having 
a lower false alarm probability. Overall, the Fano algorithm is 
the best performer at fc = 15 and A = 0.5. 

The HMM-based approaches and Reptile exhibit best per- 
formance at different values of fc, but Reptile is much more 
sensitive to this choice. In a reference-free error correction 
setup, when no explicit ground truth is available to guide 
choice of fc, the HMM-based approaches have the advantage 
of yielding robust performance over a wider range of fc. 
Quake ifTTI recommends choosing fc such that w 0.01, 
which suggests fc = 13 in this case. While Fano and A-Viterbi 
are not at their peak performance for this choice of fc, their 
performance is near-optimal. In contrast, the Reptile authors 
recommend fc = log4 \Q\ ?» 9, which indeed works well. 

B. D2: Real Experimental Dataset 

To test the performance of our model on a real Illumina 
dataset, we used the data of an E. coli resequencing experiment 
(Accession SRX000429). Knowing the reference genome al- 
lows us to identify the "ground truth" errors as long as we can 
identify the position of the read in the reference genome. To 
align the reads to the reference genome, we used the Burrows- 
Wheeler Aligner ||241 with default parameters. We selected all 
reads with a unique match to a 500Kbp region (1000Kb — 
ISOOKbp) on the reference genome and tallied the true errors 
as mismatches between the selected reads and the reference 
sequence. There are cases of erasure where the nucleotide is 
recorded as "N" for "not determined" in the reads. For Reptile 
and A-Viterbi, all N bases were replaced by A, but were left 
intact for Fano, which can directly correct base N . 

The error correction results on dataset D2 are summarized in 
Table |III] The superiority of the Fano algorithm is accentuated 



TABLE m 
Error Correction Results for D2 







k 


ce 


c 


fa 


V 






13 


354864 


0.9341 


7866 


0.9134 


Fano 


A = 0.5 


14 


353209 


0.9397 


5589 


0.9248 






15 


348592 


0.9388 


5348 


0.9244 






13 


357945 


0.921 


6950 


0.9026 


A-Viterbi 




14 


350316 


0.9172 


5076 


0.9036 






15 


344943 


0.9152 


4756 


0.9024 






(8, 8) 


270226 


0.7461 


39280 


0.6377 


Reptile 




(9, 9) 


314154 


0.8935 


20748 


0.8345 






(10, 10) 


250480 


0.7973 


8924 


0.7130 



in this real data. We presume that Fano outperforms A-Viterbi 
by exploring parts of the 4-ary tree ruled out by A-Viterbi due 
to the neighborhood constraint (cf. Section ITlI-Ab . The Quake- 
recommended k for this dataset is 14; Reptile recommends 9. 



C. Parameter Selection and Sensitivity 

While not as sensitive as Reptile to choice of k, the 
HMM-based methods are sensitive to other parameter choices. 
We mention only the strongest effects here as we have not 
yet completed a thorough analysis. Most striking, A-Viterbi 
achieved gain of only 0.857 with 7 = 0.001 and A = 500. 
The Fano algorithm is most sensitive to B. In the real dataset, 
the HMM parameters were such that a larger value of the 
bias B needed to be chosen; otherwise there was excessive 
backtracking in the running of the algorithm. 

Since we had roughly optimized the A-Viterbi and Fano 
algorithms over A, 7, A, and B, we attempted to optimize 
Reptile over two of its parameters. Through much experimen- 
tation, we could improve Reptile performance to 77 = 0.9234 
when T_expGoodCnt= 28 and T_card= 13 on the real 
dataset, which is very close yet marginally inferior, to the best 
Fano performance. However, neither HMM-based method was 
allowed quite the same diligent exploration of its parameter 
space. It is clearly imperative that we develop methods to 
choose A, 7, A, and B without reference to a test genome, so 
all methods can be compared on equal footing. Our current set 
of experiments leads us to conclude that HMM-based methods 
are substantially easier to tune, less sensitive to parameter 
settings, and more flexible when it comes to adopting more 
realistic emission distributions. 

V. Conclusion 

We establish the HMM for the problem of noisy DNA 
read correction and develop the approximate Viterbi algorithm 
and the sequential decoding algorithm with Fano metric to 
execute the error correction. Based on our test results for 
both simulated and real data, the proposed algorithms often 
outperform another state of the art method in this field. Our 
future work will include development of systematic procedures 
for choosing the parameters of the algorithms, an investigation 
of more complex, context dependent error emission distribu- 
tions and an exhaustive comparison with respect to competing 
methods. 
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